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This paper presents the improvements in the combined solution for the noise 
estimation and the speech enhancement in digital hearing aids in time 
domain. This study focuses on the single channel statistical temporal speech 
enhancement using adaptive Wiener filtering. In this technique, the noise is 
updated based on the short-term uncleaned signal to noise threshold ratio 
(ST-USNTR) of the frame. It works best if and only if the background noise 
level is low compared to that of speech of interest. We considered the time 
domain algorithms in order to consider the time varying nature of speech 
signal. The performance of the proposed algorithm is evaluated for speech 
signal with seven types of noises and three signal to noise ratios (SNR) 
levels in each type of noise. From the results, it is clear that the basic level of 
adaptive speech enhancement is obtained using statistical parameters of 
noisy speech without the need for reference input. 
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1. INTRODUCTION 


Performance of most of the systems with speech applications depends on the quality of the input 
signal. Applications like hearing aids [1], [2], specially designed for the people with sensorineural hearing 
loss needs high SNR than that of people with normal hearing to have same level of intelligibility. However, 
looking into real world scenarios consisting of background noises interferes the intended signal and also 
degrades the device performance. Therefore, to have better performance of the device, there is a need for 
enhancement of the signal of interest. To perform the signal enhancement there, exist a vast variety of 
algorithms and methodologies. Now, the problem arises with the selection of an algorithm and the 
methodology that best performs for that particular application. The below are the few criteria’s those 
influence the selection of speech enhancement algorithm for particular application. 

Type of application is the first criterion: coding or speech to speech applications. Coding 
applications of speech, requires complete noise elimination and won’t bother of speech distortions, i.e., they 
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need high SNR with tolerate intelligibility. Maximum of speech-to-speech applications, need high quality 
speech, i.e., higher SNR along with the good quality. Second criterion is based on position of the application: 
device is fixed or portable. If the device is fixed in a known place for example room, car etc, it can be easy to 
estimate the noise, which simplifies the enhancement of speech. Third one is based on number of 
microphones used in the device. Using array of directional microphones, it is easy to eliminate noise if the 
direction of intended speech or the direction of unknown noise. The solution becomes critical with single 
microphone device where there is no reference input for noise or desired signal. In this type of devices, we 
must estimate the noise from the given noisy input and then calculate the appropriate gain to enhance the 
noisy speech. Along with the above criteria noise spectrum also influences the selection of algorithm [3], [4]. 
If noise is broadband or a high frequency noise, then we can easily suppress it by passing it through a filter. 
Here the problem will arise when the noise spectrum exists within the spectrum of intended speech. Finally, 
it becomes a challenge to enhance speech signal with low-frequency noise using single channel. 

Throughout the process of speech enhancement there are two parameters to balance, the amount of 
noise reduction and the amount of speech distortion i.e., the SNR and the intelligibility of the enhanced 
speech signal. So, it is noted that having higher SNR is important without loss in intelligibility in enhanced 
speech. Chen et al. [5] explained it in three ways of the algorithm with a priori knowledge of the signal, or 
with an array of microphones or by proper changes in the Wiener filtering. In this research, the third option 
Wiener filtering have been followed for better management of speech distortion. 

In literature, most of the voice activity detection (VAD) techniques for speech enhancement have 
been implemented in the spectral domain [6]—[10]. The accuracy of VAD is an important factor in this and it 
is based on the decision rule employed in it. Many approaches have been proposed in spectral domain to 
increase the accuracy of detection. Few of the techniques based on the long-term spectral features like, long 
term spectral flatness measure (LTSF) [6], long term spectral divergence (LTSD) [7], and long term signal 
variability (LTSV) [8] between speech and noise. 

The rest of the techniques presented in literature are based on short term features both in the time 
domain or in the frequency domain. Sohn ef al. [9] proposed VAD that employs the decision-directed 
parameter estimation method for the likelihood ratio test. Jo et al. [10] proposed the VAD that employs 
support vector machine (SVM) for decision function using the LRs, where as in the conventional techniques 
perform VAD by comparing the geometric mean of the LRs with a given threshold value. Shin et al. [11] 
presented a VAD based on conditional maximum a posteriori probability (MAP) criterion that exploits the 
voice activity decision of the previous frame along with that of current frame in the estimation of probability 
of voice presence in the current frame. It outperforms the conventional VAD by using two separate 
thresholds for the likelihood ratio test (LRT), which are resulted from temporal correlations between current 
and previous frames. Upadhyay et al. [12] proposes the recursive noise estimation algorithm, in this the noise 
power is updated based on present and previous values of it with the help of a smoothing parameter. It 
depends on the filter transfer function from sample to sample based on the speech signal statistics; the local 
mean and the local variance and it is implemented in frequency domain. The proposed algorithm is similar to 
the work done in [12], but the major difference exists in the domain of computations, the above existed work 
done in spectral domain and the proposed work completely done in time domain. 

Xiao et al. [13] developed time domain speech enhancement using generative adversarial network 
(GAN) to improve the performance of the generator and also compared difference GANs available for speech 
enhancement. Tan et al. [14] proposed an end-to-end multi task model for VAD which increases the 
robustness of VAD system for low SNR conditions. Tejaswinier al. [15] compared the different approaches 
available both in frequency domain and time domain. They have discussed the steps involved in speech 
enhancement using MATLAB. They have also explained the mathematical operations involved in Fourier 
Transform, windowing, averaging, finding variance and minimum mean square error. Zhao et al. [16] 
projected the noisy speech into speech dominated subspace and noise dominated subspace and fed to 
encoders to detect the speech and noise separately. 

In this present research, the noise estimation and the corresponding gain calculations are performed 
in time domain. Both the time domain and frequency domain techniques have their own advantages and 
disadvantages [17]. Since it is complicated to implement transformation techniques in digital hearing aids, a 
simple and effective algorithm is developed for noise estimation and speech enhancement in time domain for 
low-cost applications in real time. Majorly the proposed time domain technique has three advantages: i) 
accommodates the time varying nature of the speech signal [18], ii) reduces the number of computations [19], 
and iii) avoids unpleasant signal distortion which exist in spectral domain techniques because of invalid 
short-term fourier transform (STFT) [20]. 

This paper is organized as follows. In first section, we presented the proposed algorithm. In second 
section, we have explained the proposed block diagram. In third section, results are discussed. 
The conclusions are given in the last section. 
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2. PROPOSED VAD BASED ON ST-USNTR 

In speech processing applications, the conventional VAD is used to separate incoming signal into 
voiced, un-voiced and silence frames. The procedure of VAD slightly changes in speech enhancement 
applications, since the input signal is corrupted with the background noise. So, it is important to choose 
feature that separates the incoming speech frames well in background noise. To have the accuracy of 
operation, the proposed VAD separates the incoming noisy speech into 3 different group of frames based on 
the ST-USNTR, which is defined as the ratio between the short-term temporal energy (STTE) of incoming 
noisy signal and the noise threshold (NT). The update of NT for noise frames with STTE <NT i.e., silence 
frames, speech frames with dominated noise STTE<1.5xNT and speech dominated frames is given in (1). 


NT = {NT — avg,USNTR < 0dBNT — smooth, 0dB < USNTR < 0.176dBNT — 
previous, 0.176dB < USNTR (1) 


This type of division is important to update the noise even if the VAD detects noise dominated 
speech frames and to completely suppress noise without distortion in speech. The flow chart of proposed 
algorithm is shown in the Figure 1. First, the initial 3 frames were averaged to have the reference for noise 
threshold. The frames with STTE less than or equals to this threshold are considered as noisy frames. 
Therefore, frames with the negative USNTR values are applied with a zero gain. Also, the average estimation 
of the noise threshold is done for this type of frames. Next, the frames with STTE greater than the threshold 
are considered as high USNTR frames. For this category of frames, the Wiener gain (G) from [21], [22] is 
applied with the help of the mean, the variance of the noisy speech signal. The relation between Wiener gain 
(G) and threshold USNTR is given in (2). 
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Figure 1. VAD based on ST-USNTR 


Gi = {0,USNTR < 0dBG,dB < USNTR (2) 


In this case also noise is estimated but the difference is here we used smoothed estimation of noise 
instead of average estimation. At the last, the frames with higher USNTR are applied with high gain values 
calculated by using same Wiener filtering. Here there is no option for estimation of noise threshold, therefore 
the previous estimation of it is used in gain calculations as in (3). 


NT Current = 0.5 x NT Estimated + 0.5 x NT Previous, USNTR < 0dB 
NT Current = 0.3 x NT Estimated + 0.7 x NT Previous, USNTR < 0.176dB 
NT Current = NT Previous,0.176dB < USNTR (3) 


2.1. Block diagram of proposed work 

The block diagram of proposed end-to-end time domain single channel speech enhancement is 
shown in Figure 2. The time domain analysis and synthesis of speech into frames is done using overlap 
buffering and addition. In this work, the STTE value of each frame compared with NT in the time domain. 
The frames with ST-USNR less than or equals to 0.176 dB are assumed as noise dominated frames, applied 
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with a zero gain. Next, for remaining frames the Wiener filter based on first order statistics is used to 


calculate the gain to obtain linear estimation [10] of original clean speech and also as it minimizes the mean 
squared error between the clean speech and enhanced speech. 
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Figure 2. Block diagram of time domain single channel adaptive Wiener filtering 


3. RESULTS AND DISCUSSION 

The approach for proposed speech enhancement algorithm is completely different from existing 
algorithms. It is completely based on the noise statistics and the noisy input in time domain. The simulations 
are done in MATLAB software using the noisy speech signals taken from noisy speech corpus (NOIZEUS) 
database [23]. Experiments were conducted for one clean speech utterance: “I am the small, salt and tasty”, 
added at three different SNRs OdB, 5dB and 10dB with seven different types of noises one at a time. Among 
them six are different real-world noises are taken from AURORA database namely, train, station, restaurant, 
car, airport and babble noises and the other are the most common AWGN noise. The input noisy speech from 
the database is divided into 10 msec frames i.e., 80 samples per frame using Hanning window with 50% of 
overlapping. First three incoming noisy frames were averaged to have the reference for noise threshold (NT). 
Then the incoming noisy input frames are grouped into three different categories based on the adaptive noise 
threshold and each group is applied with a corresponding gain as explained in the above sections. The Figure 
3 shows that the STTE of estimated noise adaptively changes with that of incoming noisy frames and the 
noisy (silence) frames are suppressed completely. 
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Figure 3. STTE of incoming noisy frames vs adaptive noise threshold and gain 


Figures 4 and 5 shows the temporal and the spectral plots of 4(a) and 5(a) original speech, 4(b) and 
5(b) incoming noisy speech corrupted by AWGN noise and the enhanced speech using the 4(c) and 5(c) 
direct subtraction method and the 4(d) and 5(d) proposed method respectively, plotted using MATLAB with 
the parameters defined in this paper. It can be observed that proposed method suppresses noise effectively 
with a smaller number of computations compared to the work done previously in spectral domain [24]. 


Indonesian J Elec Eng & Comp Sci, Vol. 27, No. 1, July 2022: 131-138 


Indonesian J Elec Eng & Comp Sci ISSN:2502-4752 o 135 


The short time signal-to-noise ratio (ST-SNR) [25] can be used to evaluate the speech enhancement 
algorithms either in time or frequency domain. Perhaps, the time domain evaluation of ST-SNR is one of the 
simplest objective measurements used to evaluate speech enhancement applications. Kolbæk et al. [26] 
combinedly presents six different types of loss functions to evaluate the performance of the end-to-end time 
domain speech enhancement techniques using neural networks. Among them the short time mean square 
error (ST-MSE) [26] and the short time objective intelligibility (STOD [27] are used along with the ST-SNR 
for objective evaluation of proposed speech enhancement in time domain for seven types of noises at three 
different SNRs and the results are summarised in the Table 1. From the simulations of proposed method, it is 
clear that it has given better results, if the signal intended is corrupted with the AWGN only. The frames with 
negative ST-SNR values are omitted from the calculation of ST-SNR. 
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Figure 4. Time domain plots of: (a) pure speech, (b) AWGN noisy speech, (c) speech enhanced through 
direct subtraction, and (d) speech enhanced through proposed method 
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Figure 5. Spectrograms of (a) pure speech, (b) AWGN noisy speech, (c) speech enhanced through direct 
subtraction, and (d) speech enhanced through proposed method 
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Table 1. Objective evaluation of proposed method 


Noise Type 0dB 5dB 10dB 
ST-SNR STOI ST-MSE ST-SNR STOI ST-MSE ST-SNR STOI ST-MSE 

AWGN 6.68 0.74 0.0255 9.26 0.84 0.0099 12.52 0.88 0.0037 
Train 4.01 0.611 0.095 11.09 0.80 0.0224 12.48 0.88 0.0085 
Station 6.55 0.649 0.057 10.97 0.83 0.04 14.04 0.9 0.013 
Restaurant 4.37 0.712 0.11 6.011 0.84 0.03 13.26 0.887 0.0096 
Car 3.98 0.66 0.055 8.81 0.78 0.0027 11.91 0.86 0.0085 
Airport 3.73 0.62 0.07 10.47 0.83 0.04 12.36 0.91 0.01 
Babble 5.85 0.65 0.066 9.43 0.79 0.032 11.25 0.888 0.0095 


For better comparison, the frames with very low ST-SNR values are omitted from the calculation of 
ST-SNR. From the data given in the Table 2, it is clear that the proposed method meets the performance of 
the previous methods in all the aspects with minimum number of computations since it does not involve in 
transformation of the input. The direct subtraction of estimated noise is also done in time domain like Boll’s 
subtraction in spectral domain. From the objective results, it clear that it performs well in improving ST-SNR 
but intelligibility of enhanced speech is reduced. 


Table 2. Comparison of proposed method with existing spectral domain methods 


AWGN 0dB 5dB 10dB 
ST- STOI ST- ST- STOI ST- ST- STOI ST- 
SNR MSE SNR MSE SNR MSE 
Spectral SSBoll 12.46 0.70 0.0276 13.56 0.77 0.0134 14.8 0.8 0.0072 
domain Wiener Scalart 12.5 0.79 0.0116 13.13 0.81 0.0073 14 0.83 0.0057 
Time domain Direct subtraction 12.7 0.68 0.0289 14.7 0.78 0.0172 15.93 0.84 0.013 
Proposed 12 0.74 0.0255 13.9 0.84 0.0099 15.8 0.884 0.0037 
method 


4. CONCLUSION 

The end-to-end time domain speech enhancement algorithm was implemented in this present 
research. The proposed VAD based on ST-USNTR is used to divide incoming noisy speech into 3 different 
types of frames. Since it is complicated to implement transformation techniques in digital hearing aids, a 
simple and effective algorithm is developed for noise estimation and speech enhancement in time domain for 
low-cost applications in real time. In this work, the time domain Wiener filter based on first order statistics 
was used. There is slight change in the gain calculation in order to apply it equally well to the different real- 
world noises other than an AWGN. The separation of incoming frames into three different types and the 
smoothed update of the noise threshold improved the performance of single channel speech enhancement in 
hearing aids. 
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